This is the first report for a data science project that involves word prediction using NLP. The ultimate goal for the project is to write an efficient algorithm that uses n-grams(script summaries used to tokenize the given script) to predict the next word that will appear after a given phrase.
In this report, I explore 3 text files - one from blogs, news, and twitter.
After background research, I have decided to use quanteda and data.table for my main calculations for better performance.
For initial visualisation, wordcloud package seemed most impressive.
library(quanteda)
## Package version: 1.5.1
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
library(spacyr)
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(data.table)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Below functions are used to tokenize the given script by the 'n-gram' value,
# and return the frequency of each ngram
# what = "fasterword" or what = "fastestword" is not used due to incomplete cleaning measures
getFreqs = function(dat, ng) {
dat.dfm = dfm(dat, ngrams = ng, remove_punct = T, remove_numbers = T,
remove = stopwords("english"))
dat.freq = docfreq(dat.dfm)
dat.freq = dat.freq[sort(names(dat.freq))]
return(dat.freq)
}
getTables = function(dat, ng) {
ngrams = getFreqs(dat = dat, ng = ng)
ngrams_dt = data.table(ngram = names(ngrams), freq = ngrams)
return(ngrams_dt)
}
The dataset can be downloaded from a link given in the course website. [https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip]
The unzipped file contains a directory called final, then a subdirectory called en_US, which contains the texts.
There are 3 text files. * en_US.blogs.txt - text from blog posts * en_US.news.txt - text from news articles * en_US.twitter.txt - tweets on Twitter
The goal here to display the initial data exploration, and show I am on the track to extablishing my algorithm.
numwords <- system("wc -w *.txt", intern=TRUE) # intern=TRUE to return output
numlines <- system("wc -l *.txt", intern=TRUE)
numbytes <- system("wc -c *.txt", intern=TRUE)
# number of words for each dataset
blog.numwords <- as.numeric(gsub('[^0-9]', '', numwords[1]))
news.numwords <- as.numeric(gsub('[^0-9]', '', numwords[2]))
twit.numwords <- as.numeric(gsub('[^0-9]', '', numwords[3]))
# number of lines for each dataset
blog.numlines <- as.numeric(gsub('[^0-9]', '', numlines[1]))
news.numlines <- as.numeric(gsub('[^0-9]', '', numlines[2]))
twit.numlines <- as.numeric(gsub('[^0-9]', '', numlines[3]))
# number of bytes for each dataset
blog.numbytes <- as.numeric(gsub('[^0-9]', '', numbytes[1]))
news.numbytes <- as.numeric(gsub('[^0-9]', '', numbytes[2]))
twit.numbytes <- as.numeric(gsub('[^0-9]', '', numbytes[3]))
words = rbind(blog.numwords, news.numwords, twit.numwords)
lines = rbind(blog.numlines, news.numlines, twit.numlines)
bytes = rbind(blog.numbytes, news.numbytes, twit.numbytes)
#summary
data.frame(words = words, lines= lines, Mb = bytes/1000000,
row.names = c("blog", "news", "twit"))
## words lines Mb
## blog 37334690 899288 210.1600
## news 34372720 1010242 205.8119
## twit 30374206 2360148 167.1053
For memory storage limitations, only %15 random entries from each dataset is used for calculations.
Twitter text
con <- file("./en_US.twitter.txt", "r")
twit = readLines(con, skipNul = T)
close(con)
#takin 15% of data for memory reasons
set.seed(1)
x = sample(2360148, 350000, replace = F)
train = twit[x]
TW <- corpus(train)
rm(con, train, twit, x)
summary(TW, 5)
## Corpus consisting of 350000 documents, showing 5 documents:
##
## Text Types Tokens Sentences
## text1 25 26 1
## text2 17 17 1
## text3 12 13 2
## text4 16 17 2
## text5 16 18 1
##
## Source: /Users/Nuray/Desktop/CourseraProjects/capstone/* on x86_64 by Nuray
## Created: Sat Nov 9 19:09:53 2019
## Notes:
Blog text
con <- file("./en_US.blogs.txt", "r")
blog = readLines(con, skipNul = T)
close(con)
#takin 15% of data for memory reasons
set.seed(12345)
x = sample(899288, 135000, replace = F)
train = blog[x]
BG <- corpus(train)
rm(con, train, blog, x)
summary(BG, 5)
## Corpus consisting of 135000 documents, showing 5 documents:
##
## Text Types Tokens Sentences
## text1 28 34 2
## text2 23 26 1
## text3 75 105 4
## text4 49 70 1
## text5 15 16 2
##
## Source: /Users/Nuray/Desktop/CourseraProjects/capstone/* on x86_64 by Nuray
## Created: Sat Nov 9 19:10:03 2019
## Notes:
News text
con <- file("./en_US.news.txt", "r")
news = readLines(con, skipNul = T)
close(con)
#takin 15% of data for memory reasons
set.seed(10000)
x = sample(1010242, 150000, replace = F)
train = news[x]
NS <- corpus(train)
rm(con, train, news, x)
summary(NS, 5)
## Corpus consisting of 150000 documents, showing 5 documents:
##
## Text Types Tokens Sentences
## text1 16 17 1
## text2 12 16 2
## text3 5 5 1
## text4 22 22 1
## text5 59 80 2
##
## Source: /Users/Nuray/Desktop/CourseraProjects/capstone/* on x86_64 by Nuray
## Created: Sat Nov 9 19:10:24 2019
## Notes:
Here for each corpus file(BG, NS, TW), we create a data-feature matrix with quanteda::dfm() function. The selected features for the function are: - remove_punct ; for cleaning punctuations from text - remove_numbers ; for cleaning numbers from text - remove = stopwords(“english”) ; for cleaning stopwords defined for english language by quanteda
After creating the data-feature matrix, a wordcloud visualisation is created with the wordcloud::textplot_wordcloud() function - for the words that appear more than 3000 times in total.
After wordcloud visualisation,
1.Blog text
BG.uni = dfm(BG, ngrams = 1, remove_punct = T, remove_numbers = T,
remove = stopwords("english"))
set.seed(100)
textplot_wordcloud(BG.uni, min_count = 3000, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
rm(BG.uni)
The top frequencies of the unique words as below:
headBG = getTables(dat = BG, ng = 1)[order(freq, decreasing = T)][1:10,]
plot_ly(x = headBG$ngram, y = headBG$freq)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
rm(headBG)
2.News text
NS.uni = dfm(NS, ngrams = 1, remove_punct = T, remove_numbers = T,
remove = stopwords("english"))
set.seed(1)
textplot_wordcloud(NS.uni, min_count = 3000, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
rm(NS.uni)
The top frequencies of the unique words as below:
headNS = getTables(dat = NS, ng = 1)[order(freq, decreasing = T)][1:10,]
plot_ly(x = headNS$ngram, y = headNS$freq)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
rm(headNS)
TW.uni = dfm(TW, ngrams = 1, remove_punct = T, remove_numbers = T,
remove = stopwords("english"))
set.seed(12345)
textplot_wordcloud(TW.uni, min_count = 3000, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8, "Dark2"))
rm(TW.uni)
The top frequencies of the unique words as below:
headTW = getTables(dat = TW, ng = 1)[order(freq, decreasing = T)][1:10,]
plot_ly(x = headTW$ngram, y = headTW$freq)
## No trace type specified:
## Based on info supplied, a 'bar' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#bar
rm(headTW)
To make the algorithm into an app, first I need to establish the proper functions to create observed/unobserved trigram/bigram probability tables for a given phrase - using Katz Back-Off algorithm. After creating these functions, I can calculate the probabilities and find the highest-prob next word prediction for a given phrase.
For performance related issues, I will only use trigrams, bigrams, and unigrams.
Given a phrase, the algorithm will first check the trigram to find out most likely word. If there is no probable answer, it will check the bigram. If still no probable answer, it will finally check the unigram to predict the next word based on the most common single word in the corpus.